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Abstract 


A common strategy for sparse linear regression is to introduce regularization, which elimi¬ 
nates irrelevant features by letting the corresponding weights be zeros. However, regular¬ 
ization often shrinks the estimator for relevant features, which leads to incorrect feature 
selection. 

Motivated by the above-mentioned issue, we propose Bayesian masking (BM), a sparse 
estimation method which imposes no regularization on the weights. The key concept of 
BM is to introduce binary latent variables that randomly mask features. Estimating the 
masking rates determines the relevance of the features automatically. We derive a vari¬ 
ational Bayesian inference algorithm that maximizes the lower bound of the factorized 
information criterion (FIC), which is a recently developed asymptotic criterion for evalu¬ 
ating the marginal log-likelihood. In addition, we propose reparametrization to accelerate 
the convergence of the derived algorithm. Finally, we show that BM outperforms Lasso 
and automatic relevance determination (ARD) in terms of the sparsity-shrinkage trade-off. 
Keywords: Sparse estimation. Factorized information criterion. Lasso, Automatic rele¬ 
vance determination 

1. Introduction 

In sparse linear regression, various approaches impose sparsity by implementing regulariza¬ 
tion on a weight parameter. For example, Lasso (Tibshirani, 1994) introduces sparsity by 
regularizing the weights by LI norm. Automatic relevance determination (ARD) (MacKay, 
1994; Neal, 1996) regularizes the weights by a prior distribution, with hyperparameters 
indicating the relevance of the input features. Empirical Bayes estimation of the hyper¬ 
parameters thus eliminates irrelevant features automatically. Although ARD is notorious 
for its slow convergence, several authors have improved the algorithm (e.g., (Wipf and 
Nagarajan, 2008)). 

The trade-off between sparsity and shrinkage is a crucial issue in sparse regularization 
methods (Aravkin et ah, 2014). In Lasso, for example, a large regularization constant 
incorporates strong sparsity and is more likely to estimate the weights of irrelevant features 
as zero, which is desirable in terms of interpretability. However, it also shrinks the weights 
of relevant features and may eliminate them. ARD suffers from the same problem, although 
the bias of ARD is weaker than that of Lasso (Aravkin et ah, 2014). Because both sparsity 
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and shrinkage are caused by regularization, the shrinkage effects are inevitable as long as 
we use the regularization scheme. 

To address this issue, we propose an alternative method for sparse estimation, namely, 
Bayesian masking (BM), which differs from existing methods in that it does not impose any 
regularization on the weights. Our contributions can be summarized as follows. 

• The BM model (Section 4). The BM model introduces binary latent variables into 
a linear regression model. The latent variables randomly mask features to be zero 
at each sample according to the priors that are defined for each feature, but shared 
among samples. The estimation of the priors on the masking rates determines the 
relevance of the features. 

• A variational Bayesian inference algorithm for the BM model (Section 4.2). The EM- 
like coordinate ascent algorithm maximizes the lower bound of the factorized infor¬ 
mation criterion (FIC). The convergence of the algorithm is accelerated by combining 
gradient ascent and reparametrization (Section 4.4), which are motivated by previous 
studies on convergence analysis of coordinate ascent and information geometry. 

• Analytic forms of the one-dimensional (ID) estimators of Lasso, ARD (Section 3), 
and BM (Section 4.3). The analytic estimators of these methods provide insights into 
their shrinkage mechanisms. 

Through numerical experiments, we empirically show that the proposed method outper¬ 
forms Lasso and ARD in terms of the sparsity-shrinkage trade-off. 

1.1. Notation 

Hereafter, denotes a column vector of the n-th row of a matrix X. 

2. Background 

2.1. Linear Regression and Least Squares 

Consider a linear regression model: 


( 1 ) 


y = Xf3 + e, 


where y G is a vector of target values, X G is a matrix of explanatory variables, 

(3 G is a vector of weight parameters, and e ~ ^”(0, A“^/) denotes observation noise. 
Further, N is the number of samples and K is the number of features. Because the noise 


is i.i.d. Gaussian, the maximum likelihood estimator (MLE) is given as the least-squares 
(LS) solution: 


/3ls = argmin ^ \\y - XfiWl = {X'^Xy^X^y. 



( 2 ) 


2 


2.2. Lasso 

Lasso is formulated as the Ll-penalized regression problem, and the estimator is given as 

/^Lasso = argmin^lly - XI3\\l + a||/3||i, (3) 

where a{> 0 ) is a regularization constant. 

2.3. ARD 

Consider the prior distribution of (3: 

p(/3|7) = iV(/3|0,r), (4) 

where T = diag( 7 ) and 7 is the hyperparameter determining the variance of the prior. 
ARD determines 7 by the empirical Bayes principle, i.e., by maximizing the marginal log- 
likelihood: 7 = argmax.^ f p(y|/ 3 )p(/ 3 | 7 , A)d/3. Then, the estimator of (3 is then often given 
by the posterior mean with plugged-in 7 (Wipf and Nagarajan, 2008; Aravkin et al., 2014): 

0ARD = rx^{x-^i + xrx^)-^y. ( 5 ) 

Clearly, for > 0, 7 ^ = 0 results in = 0 for any k. 


3. Trade-off between Sparsity and Shrinkage 

Concerning the trade-off between sparsity and shrinkage, Aravkin et al. (2014) derived the 
upper bounds of the estimators for K = 2 and showed that undesirable shrinkage occurs for 
both Lasso and ARD; specihcally, the shrinkage bias of Lasso is larger than that of ARD 
when an unnecessary feature is correctly pruned. 

In this section, we revisit the above-mentioned issue. To understand how sparse regu¬ 
larization works, we derive the exact forms of the estimators for K = 1 and show that ARD 
is better than Lasso in terms of the sparsity-shrinkage trade-off. Although our analysis is 
much simpler than the earlier study by Aravkin et al. (2014), it is meaningful because our 
derived estimators 

• are exact and analytically written (no approximation is needed) and 

• highlight the significant differences between ARD and Lasso. 


3.1. ID Estimators 

LS When K = 1, the matrix inverse in Eq. (2) becomes a scalar inverse and /3 ls is simply 
written as 


/3ls = 


x^y 

T ’ 

X ' X 


where we let x represent X to emphasize the dimensionality. 


( 6 ) 
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Figure 1: Illustrative comparison of the shrinkage effect with Lasso and ARD in the ID 
case. Lasso shifts the estimator to zero by the same value, whereas the shrinkage 
of ARD is larger when the LS estimator is smaller. 


Lasso For ,0 > 0, the Ll-penalty becomes af3. Thus, Eq. (3) yields a stationary point 
Pls — X where /3 ls = x~^y/x^x. For /? < 0, the solution is the same except that the 
sign is reversed. Combining both cases yields the solution for all (3: 

~ - -- 2a 

/^Lasso = sign(^Ls) max(0, |/3 ls| - (7) 

ARD Similar to Lasso, the ID estimator of ARD is analytically written as^ 

/3ard = sign(^Ls) max(0, |/3 ls| - \ , )■ (8) 

A\x y\ 

Note that we assumed that A is known. ^ 

3.2. Comparison of LS, Lasso and ARD 

Although their regularizations are different. Lasso and ARD have the same shrinkage mech¬ 
anism — subtracting the constant from /3ls and cropping /0ls to zero if its magnitude is 
smaller than that of the constant. Since the constants in Eqs. (7) and (8) are both larger 
than zero except for the noiseless case (i.e., A = oo), the shrinkage bias is inevitable in both 
Lasso with a > 0 and ARD. On the other hand, this retraction to zero is necessary for 
sparsity because it prunes the irrelevant features. 

It is worth noting that the bias of ARD is much weaker than that of Lasso when the 
scale of /3ard is large. This is easily confirmed by transforming the constant in Eq. (8) 
as (A|a;''~y|)“^ = (A£c''~a;|/3Ls|)~^) which indicates that the shrinkage is weak when |/3ls| is 
large but strong when |/3ls| is close to zero. Compared to Lasso, this behavior of ARD is 
desirable, as it maintains sparsity for weak features while alleviating unnecessary shrinkage. 
Figure I shows how shrinkage occurs in Lasso and ARD. 

1. The full derivation of Eq. (8) is shown in Appendix A. 

2. Practically, A is set to the unbiased version of MLE (Aravkin et ah, 2014). 
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4. BM Model and Inference Algorithms 

4.1. BM Model 

Obviously, the shrinkage bias of Lasso and ARD comes from their imposition regularization 
on the weights. For example, if a = 0 in Lasso, the loss function becomes equivalent to that 
of LS, and of course, no shrinkage occurs. Using 7 = oo yields the same result in ARD. 

Hence, we introduce a new estimation method that maintains sparsity by using latent 
variables instead of regularization. Let Z G {0, be binary latent variables having 

the same dimensionality of X. We insert Z between X and jS, i.e., 

y = {XoZ)(3 + e, (9) 

where ’o’ denotes the Hadamard product (i.e., element-wise multiplication). 

Z masks the features randomly at each sample. For Bayesian treatment, we introduce 
prior distributions. We assume that Z follows a Bernoulli prior distribution as Znk ~ 
Bern( 7 rfc), where indicates how feature k is relevant. Then, estimating the priors on the 
masking rates automatically determines the relevance of the features. 

We also introduce the priors for (3 and A; however, we set them to be as weak as possible 
so that their effects are negligible when N is sufficiently large. Thus, we simply employ them 
as constants as they do not depend on A, i.e., logp(/3. A) = 0(1). 

4.2. FAB-EM Algorithm 

Our approach is based on the concept of Bayesian model selection; the central task is 
to evaluate the marginal log-likelihood. However, in our case, the marginal likelihood is 
intractable. 

Thus, we adopt FIC, a recently proposed approximation for the marginal log-likelihood (Fu- 
jimaki and Morinaga, 2012; Hayashi and Fujimaki, 2013; Hayashi et ah, 2015). We also 
adopt the factorized asymptotic Bayesian inference (FAB), which provides a tractable al¬ 
gorithm for parameter inference and model pruning by optimizing the lower bound of FIC. 
The FAB algorithm alternately maximizes the lower bound in an EM-like manner. 

To obtain the lower bound of FIC, we introduce a mean-field approximation on the 
posterior distribution of Z as q{zn) = Wk^i^nk) = Oa: Then we obtain the 

objective function as 

^({Rn},/3, A, 7 r) = Eq[logp(y|A, Z,/3,A)]-|-Eg[logp(Z| 7 r)] 

- 1 ^ (log(jv^i) + SaAMZL^) - LD log iv 

+ '^H{q{znk)), ( 10 ) 

n,k 

where Eg means the expectation under q = q{Z), and H is the entropy. The derivation of 
Eq. (10) is described in Appendix B. 
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FAB E-step In the FAB E-step, we update By taking the gradient of Q w.r.t. 

Unk and setting it to zero, we obtain the following fixed-point equations: 


l^-nk — 


(^Cnk + log 


'^k 

1 - TTfc 



( 11 ) 


where a{-) is the sigmoid function and Cnk = XnkPkKUn-lxnkPk-'^i^k ^^nlXnl|3l)■ Updating 
Eq. (11) several times give us {link} at a local maximum of Q. 


FAB M-step Since only the first and second terms in Eq. (10) are relevant, the EAB 
M-step is equivalent to the M-step in the EM algorithm. We have the closed-form solutions 
to update /3, A, and tt as 


/3 

1 

A 


n-^{X o uyy, 

~ ‘^UniXn O t^n) P + (®n ° P) ^qi^n^n ](®n ° P)) 


E ^^nk 


N 


( 12 ) 

(13) 

(14) 


where U = o EqlznzJ] and M = (pi,..., pjv)"''- We note that Eq[znz'^] = 

ynfXn + diag(/.i„ - fXnO y-n)- 

Pruning step As noted in previous papers, the third term in Q penalizes {ynk} and 
automatically eliminates irrelevant features. Then, when tta, = < 6, we remove 

the corresponding feature from the model. The pseudo-code of the resulting algorithm is 
given in Algorithm 1. 


Algorithm 1 Bayesian masking by FAB-EM algorithm 
1: Initialize ({pn}, /3) A, tt) 

2: repeat 

3: Update {ynk} by Eq. (11) 

4: for k = 1 ,..., K do 

5: if Z)n Ainfc/lV < then 

6: Remove fc-th dimension from the model 

7: end if 

8 : end for 

9: Update (/3, A,7r) by Eqs. (12-14) 

10: until termination criterion is met 


4.3. Analysis of FAB Estimator 

As in Section 3, we investigate the FAB estimator of p. If y follows the linear model (1) 
with P = P*, the FAB estimator is expectedly obtained as 

Ee[/3FAB] = n-^X^Xp* 

= p* + n-^b (15) 
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for any q{Z), where X = X o M and bk = {xk o Yli^k o (1 - 

Eq. (15) immediately suggests that /3fab is biased by Q~^b. The bias consists of the 
two cross terms: {xk,xi) and (/Xfc, 1 — fii). Thus, the bias increases when Xk and xi are 
correlated and Zk and zi are negatively correlated. On the other hand, the cross terms 
become zero when Unk = 0 or Uni = 1 for all n. This also implies that the bias is weakened 
by an appropriately estimated q. In Section 5.2, we numerically evaluate the bias of FAB 
with Lasso and ARD, showing that FAB achieves the lowest bias. 

Remarkably, when AT = 1, no cross term appears and the bias vanishes for any q 
satisfying vr > 0. Furthermore, the ID estimator is simply written as 

Pfab = ^. (16) 

Again, if fj.nl = 1 for all n, /3fab recovers /3 ls- 

4.4. FAB-EG and Hybrid FAB-EM-EG Algorithms 

Hayashi and Fujimaki (2013) reported that model pruning in FAB is slow, and we find that 
our algorithm suffers from the same drawback. In our case, {vr^.} for irrelevant features 
requires many iterations until convergence to zero although the weights (3 and {vta,} for rel¬ 
evant features converge rapidly. To overcome the problem of slow convergence, we combine 
gradient ascent and reparametrization as discussed below. 

First, we replace the FAB-M step by gradient ascent, which is motivated by previous 
studies (Salakhutdinov et ah, 2003; Maeda and Ishii, 2007) on convergence analysis of the 
EM algorithm. Thus, we find that gradient ascent certainly helps; however, the convergence 
remains slow. This is because the model distribution is insensitive to the direction of 
when /3fc is small. This means that the gradient for vr^ would be shallow for an irrelevant 
feature k, since the estimator of Pk takes a small value for the feature. 

The natural gradient (NG) method (Amari, 1998) can effectively overcome the insensi¬ 
tivity in the model distribution. The key concept of the NG method is to adopt the Fisher 
information matrix as a metric on a parameter space (the so-called the Fisher information 
metric) in order to define the distance by the Kullback-Leibler (KL) divergence between 
model distributions instead of the Euclidean distance between parameter vectors. Thus, pa¬ 
rameter updates by the NG method can make steady changes in the model distribution at 
each iteration. Amari and his co-workers showed that the NG method is especially effective 
when the parameter space contains singular regions, i.e., regions where the Fisher informa¬ 
tion matrix degenerates (Amari et ah, 2006; Wei et ah, 2008). This is because the problem 
of shallow gradient is severe around a singular region, since gradient completely vanishes 
along with the region. Hence, learning by ordinary gradient is often trapped around singu¬ 
lar regions and remains trapped for many iterations, even when the models in the regions 
show poor agreement with data. In contrast, learning by the NG method is free from such 
a slow down. 

In our case, the model has two singular regions for each feature, i.e., = 0 and 

TTfc = 0. In particular, singular region = 0 causes the slow convergence. Hence, the 
NG method should be effective; however, the evaluation of the Fisher information metric is 
computationally expensive in our case. Therefore, as an alternative, we propose a simple 
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reparametrization that approximates the Fisher information metric. Our strategy is to 
perform a block diagonal approximation of the full matrix, as has been performed recently 
in (Desjardins et ah, 2015) for neural networks. 

Toward this end, we examine the Fisher information metric in the single-parameter case 
{K = 1), which approximates diagonal blocks of the full metric. Then, the model of interest 
here is simply ~ 7rA^(y„|xn/3, A“^) + (1 — 7r)N{yn\0, A“^). We focus on the case of small 
/3 because the slow learning of tt is prominent in this region as noted above. Although the 
exact form of the metric is difficult to compute, our focus on the small-/? case allows further 
approximation. Taylor expansion around /? = 0 and lowest-order approximation give us a 
metric tensor: 


( Gw Gw -N ^ / \l3^N(^-) f{n) + lh\ _ 

\ ^/37r ^TTTT / \ H J 


(17) 


where /(vr) represent polynomials of vr and (x^) = factor in G is 

G-rtTv ot /?^ because this represents the fact that a smaller value of /? makes the model more 
insensitive to changes in vr. 

The approximated metric obtained above is complicated and difficult to handle. Hence, 
we consider the following simple reparametrization for k = 1,K: 


iPk, VTfc) {/3k, Sk = /SfcTTfc). (18) 

Let us show what metric is introduced in (/?, 7r)-space through the reparametrization. To¬ 
ward this end, we recall that considering usual gradient ascent on (/?, s)-space corresponds 
to introducing a metric J in the original space, where J is the Jacobian matrix for the 
reparametrization. Then, the reparametrization corresponds to introducing a block diagonal 
metric tensor in which each diagonal block is given by 


G' = 


1 + ^fc 
/JfcTTfc (31 


I3k'^k 


(19) 


The similarity between G and G' is clear. Although they are not identical, G' is simpler 
and shares the key factor as G'^^ oc /3^. 


FAB-G step In the FAB-G step, we replace the FAB-M step by gradient ascent. We 
calculate the gradient on the parameter space after reparametrization {/3k, Sk), and project 
it onto the original space. Then, we obtain the following update rule for (/3fc,vrfc): 



where t is the number iterations and {rjt} are the learning coefficients. We note that the 
update of vrfc by is scaled by and then accelerated when /3k is small, as expected. 
When TTfc = 1, we update only /3k by Since the singularity problem comes from /3 and 

TT, not A, updating A retains the closed-form solution Eq. (14). 

We refer to the FAB algorithm as the FAB-EG algorithm when the G step replaces the M 
step. Even though the EAB-EG algorithm shows faster convergence, the fast initial progress 
in the FAB-EM algorithm remains an attractive feature. Thus, to exploit both benefits, we 
propose a hybrid algorithm in which the learning progresses by EAB-EM initially and by 
EAB-EG later. The pseudo-code of the hybrid algorithm is given in Algorithm 2. 



Algorithm 2 Bayesian masking by Hybrid FAB-EM-EG algorithm 
1: Initialize ({/^n}, /3) A, tt) 

2: t i — 0 
3: repeat 

4: Update {nn} by FAB-E step 

5: for k = 1,K do 

6 : a< 6 then 

7: Remove fc-th dimension from the model 

8 : end if 

9: end for 

10: if t < T then 

11 : Update (/3,A, tt) by FAB-M step 

12: else 

13: Update (/3, A, tt) by FAB-G step 

14: end if 

15: t i — t “t“ 1 

16: until termination criterion is met 


5. Experiments 

5.1. Overview 

First, we evaluate BM (i.e., the proposed method), Lasso, and ARD with a simple example 
where K = 2, and we show that BM outperforms the other two in terms of the sparsity- 
shrinkage trade-off. Second, using the same example, we show how the parameters are 
updated by the FAB-EM and FAB-EG algorithms. Specifically, we highlight how the use of 
gradient ascent and the introduction of the reparametrization help to avoid being trapped 
in the singular region. Third, we demonstrate that the hybrid FAB-EM-EG algorithm 
converges faster than the FAB-EM algorithm. Finally, we evaluate BM, Lasso, and ARD 
again using larger values of K. 

5.2. Experiment with Two Features 

For demonstration purposes, we borrowed a simple K = 2 setting from Aravkin et al. (2014) 
who considered that 


yi 

1/2 




( 21 ) 


where ei and 62 are sampled from A(0,0.005). Assume that the true parameter values are 
Pi = 0 and P 2 = 1 , i-e., the first feature is irrelevant and the second feature is relevant. 
Note that the variance is supposed to be known for simplicity. 


5.2.1. Comparison of BM, Lasso, and ARD 

In our setting, we generated 500 datasets, each containing 2 x 20 samples of y. Note that, 
in BM (Algorithm 1), we adopted zero tolerance for model pruning: we pruned the A;-th 
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Figure 2: Estimation results on synthetic data from Eq. (21). (Left) Box plots of estimated 
values of (32■ The green line indicates the true value ^02 = 1- (Right) Erequency 
of pruning of the irrelevant feature. The error bars represent the 95% confidence 
interval from fitting of the binomial distribution. 





feature only when was smaller than the machine epsilon. In Lasso, we determined a by 
2-fold cross validation. 

The estimation results are summarized in Figure 2; the left panel shows /32 when the 
irrelevant feature was pruned and the right panel shows the frequency of pruning of the 
irrelevant feature in the 500 trials. Note that the relevant feature was not pruned in any of 
the methods. We can easily see that BM achieved the highest sparsity without shrinkage of 
/32. On the other hand, ARD displayed no visible shrinkage as in BM; however, its sparsity 
was lower than that of BM. Lasso displayed shrinkage bias and the lowest sparsity. 

5.2.2. Learning Trajectory of FAB-EM and FAB-EG Algorithms 

Using the same simple example, we show how the parameters are updated by the FAB- 
EM and FAB-EG algorithms. Eor comparison, we also performed the FAB-EG algorithm 
without reparametrization. We fixed the learning coefficient as rjt = 2 x 10“® for FAB-EG 
and = 2 X 10“^ for FAB-EG without reparametrization. 

Figure 3 shows typical learning trajectories of /3i and tti with 100 samples. We considered 
the 10 initial points of /3i and tti located diagonally in the upper right. The initial values 
of /32 and 712 are set to the true values. In FAB-EM, the learning trajectories were trapped 
around /3i = 0. In FAB-EG without reparametrization, the trapping was mitigated but 
still occurred, especially when the initial values of /3i were small. Intuitively speaking, 
this is because the gradient for vri is shallow with small /3i. Thus, the learning trajectory 
approached to smaller /3i since the feature was irrelevant, and then, the learning of tti 
became slower. In contrast, the learning trajectories of EAB-EG with the reparametrization 
approached to vri = 0 with fewer iterations regardless of the initial points, which means that 
the irrelevant featnre was pruned quickly. This result empirically demonstrates that using 
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Figure 3: Typical learning trajectories on Pi-tti plane from 10 different initial points by 
FAB-EM steps and FAB-EG steps with/without reparametrization. 


gradient ascent alone improves the convergence only slightly, but combining it with the 
reparametrization accelerates the convergence sharply. 

5.3. Experiment with Larger Number of Features 

Next, we explain the resnlts with larger examples (10 < K < 100). We generated (3 and X 
from the uniform distribution in [0,1], and half of the elements in /3 were set as zero, y was 
generated by Eq. (1) with = 0.2. N was set as 20Ar. We controlled {yt} as described 
in Appendix C. 

5.3.1. Performance Validation of Hybrid FAB-EM-EG algorithm 

With K = 50, we demonstrate that the hybrid FAB-EM-EG algorithm converges faster 
than the EAB-EM algorithm. Toward this end, we counted the nnmber of correctly pruned 
features and plotted it against the elapsed time for the algorithms. We set T in Algorithm 2 
to 200 iterations. Fignre 4 shows the number of correctly pruned featnres against the elapsed 
time. We can clearly see the faster convergence of the hybrid FAB-EM-EG algorithm. 
Note that the faster convergence is not attributable to over-pruning because the number of 
wrongly prnned features at termination were 2.0 ± 1.3 (hybrid FAB-EM-EG) and 1.8 ± 1.3 
(FAB-EM-EG), which were nearly equal. 

5.3.2. Precision and Recall 

Por K > 2, we used two performance measures. Recall and Precision, defined as Precision 
= 772 - 3/7712 and Recall = msjmi, where mi and m 2 are the numbers of true and estimated 
irrelevant features, respectively, and m 3 is the number of correctly pruned featnres. We 
examined K = 10, 30, 50, and 100, and for each K, we generated 100 datasets. Pigure 5 
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Figure 4: Performance validation of the hybrid FAB-EM-EG algorithm. Eor the hybrid 
FAB-EM-EG and the EAB-EM algorithms, the number of correctly pruned fea¬ 
tures is plotted against the elapsed time. Different lines of the same color repre¬ 
sent different datasets. 


summarizes the estimation results for BM (Algorithm 2), Lasso, and ARD. We set the 
algorithm switching point T as 500 iterations. In Lasso, a was determined by 10-fold cross 
validation. As shown in the left and middle panels, although BM displayed slightly lower 
Precision than the others, it achieved the highest Recall. We also computed the Fi score, the 
harmonic mean of Precision and Recall, which can be interpreted as a metric for evaluating 
the performance in terms of the sparsity-shrinkage trade-off. As shown in the right panel, 
BM attained the highest Fi score for all K values in this range. Thus, we concluded that 
BM achieved the best performance for the larger values of K. 

6. Conclusion 

In this paper, we proposed a new sparse estimation method, BM, whose key feature is that it 
does not impose direct regularization on the weights. Our strategy was to introduce binary 
latent variables that randomly mask features and to perform a variational Bayesian infer¬ 
ence based on FIG. In addition, we introduced gradient ascent and the reparametrization 
to accelerate the convergence. Our analysis of the estimators of BM, Lasso, and ARD high¬ 
lighted how their sparsity mechanisms are different from one another. Finally, experimental 
comparisons of the three methods demonstrated that BM achieves the best performance in 
terms of the sparsity-shrinkage trade-off. 

Note that augmenting a statistical model by random masking variables itself is not a new 
idea. For example, van der Maaten et al. (2013) used random masking to generate virtual 
training data. However, our approach is distinguished from those studies by its purpose. 
Namely, we aim to identify whether the features are relevant or not, rather than improving 
prediction performance. In the augmented model, the FAB algorithm penalizes the masking 
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Figure 5: Performance measures are plotted with standard errors: (Left) Precision, (Mid¬ 
dle) Recall, and (Right) Fi score. 


rates, i.e., existence probability of the features, unlike the sparse regularization techniques 
where the weight values of the features are penalized. Applying the BM to real-world tasks 
where model identification is crucial, e.g., causal network inference, is a promising future 
work. 
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Appendix A. The One-Dimensional ARD Estimator 

According to Wipf and Nagarajan (2008), the negative marginal log-likelihood when K = 1 
is given by 

log |A“^/ -|- ^xx^\ + 'if' {\~^I + (22) 

= log{X~^+-fx'^x) + Xy'^y-y'^ ^ y (23) 

1 -I- X^x ' X 

= log(A~^ -k -fx'^x) -F Xy^y - . (24) 

A +^x' X 
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In the second line, we use the matrix determinant lemma (Petersen and Pedersen, 2012, 
Eq. (24)) for the first term and the variant of the Sherman-Morrison relation (Petersen and 
Pedersen, 2012, Eq. (160)) for the second term. The derivative is 


d Eq. (24) X \{x^\^{x^y^x^x 

d'f \~^ +'^x~^ X +'^x~^ X +'^x~^ xY 

x~^ x{X~^ + ^x~^ x) — X{x~^yY{X~^ + jx'^x) + X'y{x~^ yYx~^ x 

(A“^ + ^x~^xY 
X~^x'^x + ‘^{x'^xY — {x'^ yY 
(A“^ + ^x'^ xY 


The stationary point is then given as 


{x~^yY-X ^x~^x^ 
7 = max 0 , ——7 -p 

yx ' xY 


= max( 0 , /3ls “ {Xx~^x) ^). 


(25) 

(26) 

(27) 

(28) 
(29) 


Note that we use the max operator since 7 is the variance and it must be non-negative. By 
substituting the result of 7 > 0 into Eq. (5), we obtain 


/3ard = 7®"^(A 7®®"^) ^y 

X^'^x~^ xx~^ y 


= Xjx'^y - '-Yj 


X~^ -|- Yx'x 


XPlf^x'^xx'^y - 2Yhx ' y 


= - /^LS - 


'2 


Aa?~y 

X'^x^x 


= A/3£s®"^2/ - /3ls - 


= -/Ils + 2 / 3 ls - 


A“^ + ~ 

Xpfox'^xx'^y - 2Yhx'^y + A-^/3ls 


Phx'^x 


1 


XP-LSX'X 


= / 3 ls- 


1 


Xx^y 


(30) 

(31) 

(32) 

(33) 

(34) 

(35) 


Recall that /3 ard = 0 when 7 < 0. Since /3£g and {Xx~^x)~^ are both non-negative, 
the condition 7 < 0 is written as < {Xx~^x)~^, or equivalently, |/3 ls| < \^x~^y\~^. 
Substituting this condition into the above equation yields Eq. ( 8 ). 


Appendix B. Derivation of The Lower Bonnd of FIC 

Hayashi et al. (2015) obtained a general representation of the lower bound of FIC, and in 
this case, we have 

FIC > Eg[logp(y,Z|A,/3, A, 7 r)] - ^Eg[log|F^|] - ^ ^ \ogN + H{q), (36) 

where q = q{Z) and F/^ denotes the Hessian matrix of the negative log-likelihood w.r.t. (3. 
Note that the priors for (3 and A do not appear in the lower bound, since the priors do not 
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depend on N, as assumed in Section 4.1. In order to derive the FAB algorithm, we compute 


the lower bound of the second term on the right-hand side. 

The Hessian matrix is represented as 

F/3 = -V/3V/3logp(y,Z|A,/3,A,7r) (37) 

= {XoZ^iXoZ). (38) 

Then, using Hadamard’s inequality for a positive-semidefinite matrix yields 

-log|T)3| > -^log^x^fcZ^fc (39) 

k n 

> -^log(^z„fc)(^x2fc) (40) 

k n n 

= - ^ log ^ + const. (41) 

k ri 

As stated in Fujimaki and Morinaga (2012), since — logX^n-^^fc 1® ^ concave function, its 
linear approximation at Niik > 0 yields the lower bound: 

- Eg [log '^Znk]>-( log NlTk + ^ . (42) 

n V TTk J 


Thus, we obtain Eq. (10) in the main text. 

Appendix C. Control of Learning Coefficient 

We explain how to set the learning coefficients {r]t} in Section 5.3. We set r]t to a constant 
value rj] however, when the maximum of the update of {tt^} is greater than 0.05, rjt is 
modified such that the maximum is 0.05. We chose the constant rj as 2 x 10~‘^/N, where 
N is the number of samples. 
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